Coronavirus has obviously had a tremendous impact on our day to day lives; as has been the case with many countries, leaders have tried to balance the lives of people against the economic status and growth of the nation. We felt that one good way to measure the economic impact of coronavirus is through stocks - specifically the S&P 500. The reason we chose the S&P 500 is because we feel it has a good range of stocks. Particularly, we feel that it is a good measure because a significant portion of the S&P 500 is Big Tech, which has proven resilient to the coronavirus and thus reflects not only the general impact of covid but also of these tech companies. By reading the different factors and the correlated stock prices, we hope to be able to predict and find the extent to which these factors impact the stock market. The purpose of including other economic indicators was to gain a better personal understanding of how these indicators affect our market in general.
What is this data?
This data derives from analyzing various datasets regarding the US stock market, COVID-19 data (both US and China), and national unemployment and political attributes. We accumulated these to be evenly distributed across a period of 8 years, disregarding national holidays (which stock markets are closed on).
Where did it come from/ Why is it interesting to us?
Stock Market data
https://finance.yahoo.com/quote/SPY/history?p=SPY
This is the data we are trying to predict the outcome of since we believe the S&P 500 is a good reflection of the economy in general.
COVID-19 data
https://ourworldindata.org/coronavirus-source-data
We want to use this data since coronavirus is a big part of our life as of current and we know that it has and has had a significant impact on the economy. Using this information could give us key insights into how much COVID has impacted us.
VIX (Volatility Index) data
https://finance.yahoo.com/quote/%5EVIX/history?p=%5EVIX
Investors use the volatility index as a measure to see how much fear or risk there is in the market expectation for the future. By tracking VIX, which investors can use to make broader market decisions, we might be able to predict deeper drops or steeper rises in stock prices.
Unemployment Rates data https://beta.bls.gov/dataViewer/view/timeseries/LNS14000000;jsessionid=A7DA95FF6205EEF162A8031A92AE90AB
Unemployment rate, when reported and is high, generally has a negative impact on the economy as it is a sign that a particular economy isn’t doing well enough to provide enough jobs to the labor market. This would reflect in the resulting investment into that economy’s existing companies that are supposed to provide said jobs.
Political Data
https://www.loc.gov/rr/print/list/057_chron.html
One of the key topics discussed in politics is the economic positions of each party. Faith in the positive economic outcome of one ideology could bolster investment, while a belief that an ideology could be harmful to the economy could result in preemptive withdrawals of investments.
Fiscal Data
https://fred.stlouisfed.org/series/GDP
This is a result showing how the economy of a country is doing overall over a single quarter. High increases in quarterly GDP indicates that the economy of that country is growing fast and could be invested in to yield some of the returns of a growing economy; low increases might indicate that the economy is stagnant and therefore might be unappealing to investors.
What are the variables?
Dates
S&P 500 ETF (SPY) Closing Price
S&P 500 ETF (SPY) Daily Volume
New Deaths - USA
New Cases - USA
New Deaths - China
New Cases - China
New Tests - USA
VIX Open
VIX Close
US Monthly Unemployment Rate
US Sitting President
US Congress Majority Party
US Quarterly GDP
What is the goal of this model?
Ultimately, we want to find which of these predictor variables are most useful for predicting SPY’s closing price and what interactions between these predictors will yield us the most impactful model.
First, we load in our data and convert our categorical variables from character types to format types
# reading in our data file
spy500_predictors = read.csv("spy500_predictors.csv")
# These are our columns.
# We are using the predictors: SPY-Volume, US New Daily Cases, US New Daily Deaths, US New Daily Test, China New Daily Cases, China New Daily Deaths, US Unemployment Rate, US Quarterly GDP, sitting President of US, US Senate majority party, VIX (Volatility Index) Open, VIX Close
colnames(spy500_predictors)## [1] "Date" "SPY...Close"
## [3] "SPY...Volume" "US.New.Daily.Cases"
## [5] "US.New.Daily.Deaths" "US.New.Daily.Tests"
## [7] "China.New.Daily.Cases" "China.New.Daily.Deaths"
## [9] "US.Unemployment.Rate...." "US.Quarterly.GDP"
## [11] "Sitting.President" "Senate.Majority.Party"
## [13] "VIX...Open" "VIX...Close"
## [15] "Season"
Next, we create plots to observe the general correlation of each of our predictors to the closing price of SPY.
# Graphical Representation of the response variable vs. each of our predictor variables
par(mfrow=c(2,2))
plot(SPY...Close~1, pch = 20, data = spy500_predictors, xlab = "Days past 8/27/12", ylab = "SPY Close", main = "SPY Close Null Model") # this is our null model
plot(SPY...Close~US.Quarterly.GDP,pch = 20, data = spy500_predictors, xlab = "US Quarterly GDP (in $)", ylab = "SPY Close", main = "US Quarterly GDP vs SPY Close")
plot(SPY...Close~US.New.Daily.Cases,pch = 20, data = spy500_predictors, xlab = "US New Daily Covid Cases", ylab = "SPY Close", main = "US New Daily Covid Cases vs SPY Close", ylim = c(200, 400))
plot(SPY...Close~SPY...Volume,pch = 20, data = spy500_predictors,xlab = "SPY Volume", ylab = "SPY Close", main = "SPY Volume vs SPY Close")plot(SPY...Close~US.New.Daily.Tests,pch = 20, data = spy500_predictors, xlab = "US New Daily Covid Tests", ylab = "SPY Close", main = "US New Daily Covid Tests vs SPY Close")
plot(SPY...Close~US.New.Daily.Deaths,pch = 20, data = spy500_predictors, xlab = "US New Daily Covid Deaths", ylab = "SPY Close", main = "US New Daily Covid Deaths vs SPY Close")
plot(SPY...Close~China.New.Daily.Cases,pch = 20, data = spy500_predictors, xlab = "China New Daily Covid Cases", ylab = "SPY Close", main = "China New Daily Covid Cases vs SPY Close")
plot(SPY...Close~China.New.Daily.Deaths,pch = 20, data = spy500_predictors, xlab = "US New Daily Covid Cases", ylab = "SPY Close", main = "US New Daily Covid Cases vs SPY Close")plot(SPY...Close~US.Unemployment.Rate....,pch = 20, data = spy500_predictors, xlab = "US Unemployment Rate", ylab = "SPY Close", main = "US Unemployment Rate vs SPY Close")
plot(SPY...Close~VIX...Open,pch = 20, data = spy500_predictors, xlab = "VIX Open", ylab = "SPY Close", main = "Vix Open vs SPY Close")
plot(SPY...Close~VIX...Close,pch = 20, data = spy500_predictors, xlab = "VIX Close", ylab = "SPY Close", main = "Vix Close vs SPY Close")
plot(SPY...Close~Senate.Majority.Party, data = spy500_predictors, xlab = "Senate Majority Party (factored)", ylab = "SPY Close", main = "Senate Majority Party vs SPY Close")plot(SPY...Close~Sitting.President, data = spy500_predictors, xlab = "Sitting President (factored)", ylab = "SPY Close", main = "Sitting President vs SPY Close")Based on what we see in the graphs, we can observe that the GPD seems to be positively correlated with the SPY 500’s closing price and that the unemployment rate is generally negatively correlated (as expected). Ironically, it seems as though the the number of new daily COVID cases has had a positive impact on closing price as there was widespread panic when the US was at relatively low COVID cases and the economy now seems to have made a full recovery despite COVID cases being higher than ever. Thankfully however, we can observe that deaths still negatively impacted the price of SPY, and the disconnect between deaths and cases can be attributed to the lowering death rate as cases increased. We also see that the number of tests conducted per day appears to have a relatively strong positive correlation to the stock price.
Indicators like volume and the VIX are actually found to be less correlated than we had expected but must be further tested to truly observe their significance. Also, the data from China seems to have relatively little correlation.
One flaw we discovered while working on this project, which highlighted the ability of statistics to sometimes obscure the truth, was that based on the data frame we chose, it would appear that Donald Trump and the Republican Senate majority have overall been correlated with a higher SPY price over the last 8 years. However, what this fails to account for is that fact that the price of the SPY 500 goes up overall over time, so the last 8 years would be a very biased time frame with the Republican Senate and Donald Trump being associated with the $300+ SPY close priced of 2019 and beyond.
rmse = function(m) {
sqrt(mean(resid(m) ^ 2))
}
volume_model = lm(SPY...Close ~ SPY...Volume, data = spy500_predictors)
print("volume_model: Adj R-squared, R-Squared, RMSE")
summary(volume_model)$adj.r.squared
summary(volume_model)$r.squared
rmse(volume_model)
US_daily_cases_model = lm(SPY...Close ~ US.New.Daily.Cases, data = spy500_predictors)
print("US_daily_cases_model: ")
summary(US_daily_cases_model)$adj.r.squared
summary(US_daily_cases_model)$r.squared
rmse(US_daily_cases_model)
China_daily_cases_model = lm(SPY...Close ~ China.New.Daily.Cases, data = spy500_predictors)
print("China_daily_cases_model:")
summary(China_daily_cases_model)$adj.r.squared
summary(China_daily_cases_model)$r.squared
rmse(China_daily_cases_model)
US_unemployment_rate_model = lm(SPY...Close ~ US.Unemployment.Rate...., data = spy500_predictors)
print("US_unemployment_rate_model: Adj R-squared, R-Squared, RMSE")
summary(US_unemployment_rate_model)$adj.r.squared
summary(US_unemployment_rate_model)$r.squared
rmse(US_unemployment_rate_model)
US_GDP_model = lm(SPY...Close ~ US.Quarterly.GDP, data = spy500_predictors)
print("US_GDP_model: Adj R-squared, R-Squared, RMSE")
summary(US_GDP_model)$adj.r.squared
summary(US_GDP_model)$r.squared
rmse(US_GDP_model)
US_president_model = lm(SPY...Close ~ Sitting.President, data = spy500_predictors)
print("US_president_model: Adj R-squared, R-Squared, RMSE")
summary(US_president_model)$adj.r.squared
summary(US_president_model)$r.squared
rmse(US_president_model)
US_senate_majority_model = lm(SPY...Close ~ Senate.Majority.Party, data = spy500_predictors)
print("US_senate_majority_model: Adj R-squared, R-Squared, RMSE")
summary(US_senate_majority_model)$adj.r.squared
summary(US_senate_majority_model)$r.squared
rmse(US_senate_majority_model)
VIX_open_model = lm(SPY...Close ~ VIX...Open, data = spy500_predictors)
print("VIX_open_model: Adj R-squared, R-Squared, RMSE")
summary(VIX_open_model)$adj.r.squared
summary(VIX_open_model)$r.squared
rmse(VIX_open_model)
VIX_close_model = lm(SPY...Close ~ VIX...Close, data = spy500_predictors)
print("VIX_close_model: Adj R-squared, R-Squared, RMSE")
summary(VIX_close_model)$adj.r.squared
summary(VIX_close_model)$r.squared
rmse(VIX_close_model)
season_model = lm(SPY...Close ~ Season, data = spy500_predictors)
print("season_model: Adj R-squared, R-Squared, RMSE")
summary(season_model)$adj.r.squared
summary(season_model)$r.squared
rmse(season_model)## [1] "volume_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.1250146
## [1] 0.1254382
## [1] 49.23804
## [1] "US_daily_cases_model: "
## [1] 0.1991564
## [1] 0.199544
## [1] 47.10578
## [1] "China_daily_cases_model:"
## [1] 0.02147443
## [1] 0.02194806
## [1] 52.06987
## [1] "US_unemployment_rate_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.06777032
## [1] 0.06822154
## [1] 50.82318
## [1] "US_GDP_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.8951023
## [1] 0.8951531
## [1] 17.04839
## [1] "US_president_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.7421466
## [1] 0.7422714
## [1] 26.72925
## [1] "US_senate_majority_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.4924407
## [1] 0.4926864
## [1] 37.50108
## [1] "VIX_open_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.05579382
## [1] 0.05625084
## [1] 51.14861
## [1] "VIX_close_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.0530402
## [1] 0.05349855
## [1] 51.22314
## [1] "season_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.01239208
## [1] 0.01382617
## [1] 52.28562
This R^2, Adjusted R^2, and RMSE of each single predictor linear model are shown and we used these values to gain some intuition as to which predictors we wanted to model.
Now that we have some basic understanding of the predictor variables being used and their general correlation visualized along with some statistics to quantify the correlation, we will begin our statistical modeling with a naive additive model which uses all of the predictor variables.
# The null model against which to compare our additive model in Anova Testing
null_model = lm(SPY...Close ~ 1, data = spy500_predictors)
# Creating our naive additive model which uses all of our predictors
naive_additive_model_full = lm(SPY...Close ~ SPY...Volume + US.New.Daily.Cases + US.New.Daily.Deaths + US.New.Daily.Tests + China.New.Daily.Cases + China.New.Daily.Deaths + US.Unemployment.Rate.... + US.Quarterly.GDP + Sitting.President + Senate.Majority.Party + VIX...Open + VIX...Close + Season, data = spy500_predictors)
# All assumptions of regression model seem to have been met based on these assumption checks
par(mfrow=c(2,2))
plot(naive_additive_model_full)par(mfrow=c(1,1))
# Checking the results of our first naive additive linear regression model
predictions = predict(naive_additive_model_full)
plot(spy500_predictors$SPY...Close ~ 1)
lines(predictions ~ 1, col = "green")##
## Call:
## lm(formula = SPY...Close ~ SPY...Volume + US.New.Daily.Cases +
## US.New.Daily.Deaths + US.New.Daily.Tests + China.New.Daily.Cases +
## China.New.Daily.Deaths + US.Unemployment.Rate.... + US.Quarterly.GDP +
## Sitting.President + Senate.Majority.Party + VIX...Open +
## VIX...Close + Season, data = spy500_predictors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.016 -5.968 -0.397 6.027 55.911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.737e+02 8.991e+00 -41.562 < 2e-16 ***
## SPY...Volume 7.732e-09 6.379e-09 1.212 0.225615
## US.New.Daily.Cases -1.366e-04 4.706e-05 -2.904 0.003728 **
## US.New.Daily.Deaths 5.058e-03 1.395e-03 3.625 0.000296 ***
## US.New.Daily.Tests 4.619e-05 2.667e-06 17.320 < 2e-16 ***
## China.New.Daily.Cases 4.815e-03 5.506e-04 8.745 < 2e-16 ***
## China.New.Daily.Deaths 2.055e-02 7.454e-03 2.756 0.005896 **
## US.Unemployment.Rate.... 4.433e+00 2.933e-01 15.114 < 2e-16 ***
## US.Quarterly.GDP 3.118e-02 4.536e-04 68.745 < 2e-16 ***
## Sitting.PresidentDonald Trump 3.688e+00 9.546e-01 3.863 0.000115 ***
## Senate.Majority.PartyRepublican -1.434e+00 8.068e-01 -1.778 0.075632 .
## VIX...Open -4.945e-01 1.357e-01 -3.643 0.000276 ***
## VIX...Close -6.364e-01 1.424e-01 -4.470 8.26e-06 ***
## SeasonSpring 1.500e+00 6.135e-01 2.444 0.014592 *
## SeasonSummer 4.811e-01 5.932e-01 0.811 0.417478
## SeasonWinter 2.240e+00 6.262e-01 3.577 0.000356 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.612 on 2051 degrees of freedom
## Multiple R-squared: 0.9669, Adjusted R-squared: 0.9667
## F-statistic: 3998 on 15 and 2051 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: SPY...Close ~ 1
## Model 2: SPY...Close ~ SPY...Volume + US.New.Daily.Cases + US.New.Daily.Deaths +
## US.New.Daily.Tests + China.New.Daily.Cases + China.New.Daily.Deaths +
## US.Unemployment.Rate.... + US.Quarterly.GDP + Sitting.President +
## Senate.Majority.Party + VIX...Open + VIX...Close + Season
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2066 5729959
## 2 2051 189480 15 5540479 3998.1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on what we observe in the summary of the additive model, the anova tests, and the plot, we have a decent model with a relatively good R^2 value which we can continue to build open. We further confirmed that all of the predictors we were using were significant by individually removing them from the overall model and performing ANOVA tests to conclude that the model would be significantly different with a lower adjusted R-squared value if we removed any one of the predictors. This meant that our full additive model was the best one we had up to this point.
Here we intentionally created an over-fitted model so that we could figure out what polynomial of each predictor variable would be best for our model and then went ahead and chose the appropriate polynomials.
overfitted_model = lm(SPY...Close ~ poly(SPY...Volume, 10) + poly(US.New.Daily.Cases, 10) + poly(US.New.Daily.Deaths, 10) + poly(US.New.Daily.Tests, 10) + poly(China.New.Daily.Cases, 1) + poly(China.New.Daily.Deaths, 1) + poly(US.Unemployment.Rate...., 10) + poly(US.Quarterly.GDP, 10) + Sitting.President + Senate.Majority.Party + poly(VIX...Open, 10) + Season, data = spy500_predictors)
summary(overfitted_model)$adj.r.squared## [1] 0.9894362
## [1] 5.309677
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
par(mfrow=c(1,1))
predictions = predict(overfitted_model)
plot(spy500_predictors$SPY...Close ~ 1)
lines(predictions ~ 1, col = "green")As we can see, our over-fitted model violates some assumptions and is clearly over-fitted in the prediction graph.
Now that we had figured out which polynomials of predictors to use we were ready to test various interactions until we found the most ideal model for our purpose.
# all assumptions met
adjusted_polynomial_add_model = lm(SPY...Close ~ poly(SPY...Volume, 2) + poly(US.New.Daily.Cases, 5) + poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests, 2) + China.New.Daily.Cases + China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party + poly(VIX...Open, 5) + Season, data = spy500_predictors)
summary(adjusted_polynomial_add_model)$adj.r.squared## [1] 0.9852856
## [1] 6.33082
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
par(mfrow=c(1,1))
# all assumptions met - except a few influential points
adjusted_polynomial_interactive_model = lm(SPY...Close ~ poly(SPY...Volume, 2) + poly(US.New.Daily.Cases, 5)*poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests, 2) + China.New.Daily.Cases + China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party + poly(VIX...Open, 5) + Season, data = spy500_predictors)
summary(adjusted_polynomial_interactive_model)$adj.r.squared## [1] 0.9854494
## [1] 6.272174
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
par(mfrow=c(1,1))
# Very tighly fit model violating assumptions
adjusted_polynomial_interactive_model_2 = lm(SPY...Close ~ poly(SPY...Volume, 2)*poly(VIX...Open, 5) + poly(US.New.Daily.Cases, 5)*poly(US.New.Daily.Deaths, 3)*poly(US.New.Daily.Tests, 2) + China.New.Daily.Cases*China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party + Season, data = spy500_predictors)
summary(adjusted_polynomial_interactive_model_2)$adj.r.squared## [1] 0.9875328
## [1] 5.723089
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
par(mfrow=c(1,1))
# another model violating assumptions
adjusted_polynomial_interactive_model_3 = lm(SPY...Close ~ poly(SPY...Volume, 2)*poly(VIX...Open, 5) + poly(US.New.Daily.Cases, 5)*poly(US.New.Daily.Deaths, 3)*poly(US.New.Daily.Tests, 2) + China.New.Daily.Cases*China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6)*poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party + Season, data = spy500_predictors)
summary(adjusted_polynomial_interactive_model_3)$adj.r.squared## [1] 0.9942734
## [1] 3.84295
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
par(mfrow=c(1,1))
# Here is our final model with only has less than 5 influential points out of over 2000
adjusted_polynomial_interactive_model_4 = lm(SPY...Close ~ poly(SPY...Volume, 2) + poly(VIX...Open, 5) + I(SPY...Volume ^ 2)*I(VIX...Open^5) + poly(US.New.Daily.Cases, 5) + poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests, 2) + I(US.New.Daily.Cases^5)*I(US.New.Daily.Deaths^3)*I(US.New.Daily.Tests^2) + China.New.Daily.Cases*China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6) + poly(US.Quarterly.GDP, 6) + I(US.Unemployment.Rate.... ^ 6)*I(US.Quarterly.GDP^6) + Sitting.President + Senate.Majority.Party + Season, data = spy500_predictors)
summary(adjusted_polynomial_interactive_model_4)$adj.r.squared## [1] 0.9865593
## [1] 6.038683
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Analysis of Variance Table
##
## Model 1: SPY...Close ~ poly(SPY...Volume, 2) + poly(US.New.Daily.Cases,
## 5) + poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests,
## 2) + China.New.Daily.Cases + China.New.Daily.Deaths + poly(US.Unemployment.Rate....,
## 6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party +
## poly(VIX...Open, 5) + Season
## Model 2: SPY...Close ~ poly(SPY...Volume, 2) + poly(US.New.Daily.Cases,
## 5) * poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests,
## 2) + China.New.Daily.Cases + China.New.Daily.Deaths + poly(US.Unemployment.Rate....,
## 6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party +
## poly(VIX...Open, 5) + Season
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2030 82844
## 2 2015 81316 15 1527.8 2.5238 0.001024 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: SPY...Close ~ poly(SPY...Volume, 2) + poly(US.New.Daily.Cases,
## 5) * poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests,
## 2) + China.New.Daily.Cases + China.New.Daily.Deaths + poly(US.Unemployment.Rate....,
## 6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party +
## poly(VIX...Open, 5) + Season
## Model 2: SPY...Close ~ poly(SPY...Volume, 2) + poly(VIX...Open, 5) + I(SPY...Volume^2) *
## I(VIX...Open^5) + poly(US.New.Daily.Cases, 5) + poly(US.New.Daily.Deaths,
## 3) + poly(US.New.Daily.Tests, 2) + I(US.New.Daily.Cases^5) *
## I(US.New.Daily.Deaths^3) * I(US.New.Daily.Tests^2) + China.New.Daily.Cases *
## China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6) +
## poly(US.Quarterly.GDP, 6) + I(US.Unemployment.Rate....^6) *
## I(US.Quarterly.GDP^6) + Sitting.President + Senate.Majority.Party +
## Season
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2015 81316
## 2 2022 75375 -7 5941.5
Based on our results, the best model we have generated so far is the adjusted_polynomial_interactive_model_2. We will not check how effective this model is at predicting the same time series.
predictions = predict(adjusted_polynomial_interactive_model_4)
plot(spy500_predictors$SPY...Close ~ 1, xlab = "Days past 8/27/12", ylab = "SPY Close",)
lines(predictions ~ 1, col = "green")The final model which we select appears to do a rather good job of modeling the actual price of the SPY 500. As we can see, there have recently been times when the SPY 500 overreacts to corona virus scares, but overall, the economic indicators and COVID measures chosen seem to be fairly correlated with the stock’s price.
We have chosen adjusted_polynomial_interactive_model_4 as our final model. However, we will see that there is an issue.
par(mfrow = c(1,2))
plot(adjusted_polynomial_interactive_model_4, which = 1, col = "dodgerblue", pch = 20)
plot(adjusted_polynomial_interactive_model_4, which = 2, col = "dodgerblue", pch = 20)adjusted_polynomial_interactive_model_4 does not violate the constant variance, linearity, or normality. While our Q-Q plot tapers off slightly, it is due to the enormous amount of data points we consider.
## poly(SPY...Volume, 2)1
## 2.452963e+00
## poly(SPY...Volume, 2)2
## 1.486178e+00
## poly(VIX...Open, 5)1
## 1.593967e+01
## poly(VIX...Open, 5)2
## 1.708614e+01
## poly(VIX...Open, 5)3
## 4.296714e+00
## poly(VIX...Open, 5)4
## 1.854856e+00
## poly(VIX...Open, 5)5
## 2.019789e+00
## I(SPY...Volume^2)
## 7.308237e+37
## I(VIX...Open^5)
## 4.778595e+21
## poly(US.New.Daily.Cases, 5)1
## 4.210624e+01
## poly(US.New.Daily.Cases, 5)2
## 1.046380e+01
## poly(US.New.Daily.Cases, 5)3
## 5.490332e+00
## poly(US.New.Daily.Cases, 5)4
## 3.143398e+01
## poly(US.New.Daily.Cases, 5)5
## 5.054705e+00
## poly(US.New.Daily.Deaths, 3)1
## 3.325179e+00
## poly(US.New.Daily.Deaths, 3)2
## 1.289556e+02
## poly(US.New.Daily.Deaths, 3)3
## 1.390292e+01
## poly(US.New.Daily.Tests, 2)1
## 1.719547e-08
## poly(US.New.Daily.Tests, 2)2
## 6.634159e-07
## I(US.New.Daily.Cases^5)
## 7.017566e+54
## I(US.New.Daily.Deaths^3)
## 1.130145e+25
## I(US.New.Daily.Tests^2)
## 9.224942e+27
## China.New.Daily.Cases
## 4.825236e+09
## China.New.Daily.Deaths
## 1.148448e+07
## poly(US.Unemployment.Rate...., 6)1
## 5.008945e+00
## poly(US.Unemployment.Rate...., 6)2
## 2.878927e+27
## poly(US.Unemployment.Rate...., 6)3
## 9.654434e+25
## poly(US.Unemployment.Rate...., 6)4
## 8.127351e+23
## poly(US.Unemployment.Rate...., 6)5
## 2.614995e+21
## poly(US.Unemployment.Rate...., 6)6
## 2.377530e+18
## poly(US.Quarterly.GDP, 6)1
## 3.155056e+14
## poly(US.Quarterly.GDP, 6)2
## 1.984918e-27
## poly(US.Quarterly.GDP, 6)3
## 2.810640e-02
## poly(US.Quarterly.GDP, 6)4
## 2.609928e-02
## poly(US.Quarterly.GDP, 6)5
## 4.996574e-03
## poly(US.Quarterly.GDP, 6)6
## 4.660478e-03
## I(US.Unemployment.Rate....^6)
## 1.389896e+13
## I(US.Quarterly.GDP^6)
## 1.164150e+02
## Sitting.PresidentDonald Trump
## 7.103104e-66
## Senate.Majority.PartyRepublican
## 1.179825e-72
## SeasonSpring
## 6.361464e-42
## SeasonSummer
## 1.415238e-10
## SeasonWinter
## 5.445415e-62
## I(SPY...Volume^2):I(VIX...Open^5)
## 1.352156e-39
## I(US.New.Daily.Cases^5):I(US.New.Daily.Deaths^3)
## 1.089915e+71
## I(US.New.Daily.Cases^5):I(US.New.Daily.Tests^2)
## 5.100804e+76
## I(US.New.Daily.Deaths^3):I(US.New.Daily.Tests^2)
## 5.550326e+45
## China.New.Daily.Cases:China.New.Daily.Deaths
## 2.657590e+14
## I(US.Unemployment.Rate....^6):I(US.Quarterly.GDP^6)
## 3.691105e+67
## I(US.New.Daily.Cases^5):I(US.New.Daily.Deaths^3):I(US.New.Daily.Tests^2)
## 3.199532e+95
The issue lies in some predictor variables having extremely high VIF values and therefore adjusted_polynomial_interactive_model_4 suffers a bit from multicollinearity. We will use Akaike Information Criterion to improve upon our model to arrive at final_mod.
par(mfrow = c(1, 2))
plot(final_mod, which = 1, col = "dodgerblue", pch = 20)
plot(final_mod, which = 2, col = "dodgerblue", pch = 20)
final_mod does not violate the constant variance, linearity, or normality.
## [1] 0.9865657
## [1] TRUE
## [1] TRUE
## [1] TRUE TRUE
extractAIC(final_mod, k = log(length(resid(final_mod)))) < extractAIC(adjusted_polynomial_interactive_model_4, k = log(length(resid(adjusted_polynomial_interactive_model_4))))## [1] TRUE TRUE
We can see that our final_mod outperforms adjusted_polynomial_interactive_model_4.
From our results we obtain that VIX open, SPY volume, US new daily cases, US new daily deaths, US new daily tests, China new daily cases, China new daily deaths, US unemployment rate, US quarterly gdp, sitting President, Senate majority party, and the seasons are significant predictors for the response of the SPY Closing Price. We omitted VIX Close and the season of fall, since these variables either were dropped in the final step function we used, or caused multiple test statistics to fluctuate (too high of a p-value, R squared value). We also determined that China New Daily Tests were not useful in the creation of various models/plots for our data since there were many sporadic and unnatural patterns within it, so we did not include that either. A certain aspect of our analysis that created cause for concern however, was that it potentially overfits the data, which we determined by observing our predictive model plot that we created. This was potentially caused by the various transformations we performed on our initial model. Our final model is useful to where we are able to use it to predict our response (the SPY Close) for future days where the stock market is open, therefore achieving our goal in finding out which of our predictor variables significantly or insignificantly impacted the stock market in the scope of out dataset.
We chose not to data dump the summaries of each individual regression models here because based off of our evaluations, all of the information pertinent to our project has already been included in the methods section.
Jiesen Zhang (jiesenz2)
Akhil Bhamidipati (akhilsb2)
Raymond Moy (rymoy2)
Ayan Bhowmick (ayanb2)